Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression test updates: global_4dvar bug fix, oom fix, enhance error checking #532

Merged

Conversation

RussTreadon-NOAA
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA commented Feb 11, 2023

Description
Regression tests using develop found that ctest global_4dvar seg faulted during the lanczos solver execution of gsi.x. This was traced to the wrong anavinfo file being used in the test. During this investigation other issues were found with the regression tests.

The Orion job configuration for rrfs_3denvar_glbens was insufficient. The job aborted with an OOM error. The Hera job configuration requested more resources. The Orion configuration was updated to be consistent with Hera.

Error checking was enhanced in the regression test scripts. Log files are now retained if a ctest fails.

Fixes #531

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?
The full suite of 9 ctests will be run run on Hera, Orion, and WCOSS2. Test results will be documented in this PR.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • New and existing tests pass with my changes

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Feb 11, 2023
@RussTreadon-NOAA
Copy link
Contributor Author

Below are ctest results on various platforms

WCOSS2 (Dogwood)

russ.treadon@dlogin07:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr532/build> ctest -j 9
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr532/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #8: netcdf_fv3_regional ..............***Failed  1202.62 sec
2/9 Test #7: rrfs_3denvar_glbens ..............   Passed  1204.93 sec
3/9 Test #4: hwrf_nmm_d2 ......................   Passed  1206.22 sec
4/9 Test #9: global_enkf ......................   Passed  1208.29 sec
5/9 Test #5: hwrf_nmm_d3 ......................   Passed  1212.17 sec
6/9 Test #6: rtma .............................   Passed  1689.50 sec
7/9 Test #1: global_3dvar .....................   Passed  1981.94 sec
8/9 Test #3: global_4denvar ...................***Failed  1982.05 sec
9/9 Test #2: global_4dvar .....................   Passed  1982.06 sec

78% tests passed, 2 tests failed out of 9

Total Test time (real) = 1982.07 sec

The following tests FAILED:
          3 - global_4denvar (Failed)
          8 - netcdf_fv3_regional (Failed)
Errors while running CTest

The global_4denvar failure is due to the scalability check.

The case has Failed the scalability test.
The slope for the update (59.493756 seconds per node) is less than that for the control (98.365727 seconds per node).

The netcdf_fv3_regional failure is due to the hardware memory limit check.

The memory for netcdf_fv3_regional_loproc_updat is 203992 KBs.  This has exceeded maximum allowable memory of 195100 KBs,
resulting in Failure memthresh of the regression test.

Neither of these are fatal failures.

Orion

Orion-login-2:/work/noaa/da/rtreadon/git/gsi/pr532/build$ ctest -j 9
Test project /work/noaa/da/rtreadon/git/gsi/pr532/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #9: global_enkf ......................   Passed  486.37 sec
2/9 Test #8: netcdf_fv3_regional ..............   Passed  604.57 sec
3/9 Test #7: rrfs_3denvar_glbens ..............   Passed  606.59 sec
4/9 Test #4: hwrf_nmm_d2 ......................   Passed  608.80 sec
5/9 Test #5: hwrf_nmm_d3 ......................***Failed  736.55 sec
6/9 Test #6: rtma .............................   Passed  1211.83 sec
7/9 Test #3: global_4denvar ...................   Passed  1562.45 sec
8/9 Test #2: global_4dvar .....................   Passed  1741.73 sec
9/9 Test #1: global_3dvar .....................   Passed  1926.41 sec

89% tests passed, 1 tests failed out of 9

Total Test time (real) = 1926.41 sec

The following tests FAILED:
          5 - hwrf_nmm_d3 (Failed)
Errors while running CTest
Output from these tests are in: /work/noaa/da/rtreadon/git/gsi/pr532/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

The hwrf_nmm_d3 failure is due to the scalability check.

The case has Failed the scalability test.
The slope for the update (45.569103 seconds per node) is less than that for the control (203.662483 seconds per node).

This is not a fatal failure.

Hera

Hera(hfe09):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr532/build$ ctest -j 9
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr532/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #7: rrfs_3denvar_glbens ..............   Passed  607.66 sec
2/9 Test #9: global_enkf ......................   Passed  842.06 sec
3/9 Test #8: netcdf_fv3_regional ..............   Passed  845.70 sec
4/9 Test #5: hwrf_nmm_d3 ......................   Passed  922.30 sec
5/9 Test #4: hwrf_nmm_d2 ......................   Passed  1030.66 sec
6/9 Test #2: global_4dvar .....................   Passed  1682.02 sec
7/9 Test #6: rtma .............................   Passed  1873.20 sec
8/9 Test #3: global_4denvar ...................   Passed  2088.57 sec
9/9 Test #1: global_3dvar .....................   Passed  2328.40 sec

100% tests passed, 0 tests failed out of 9

Total Test time (real) = 2328.41 sec

@RussTreadon-NOAA RussTreadon-NOAA marked this pull request as ready for review February 11, 2023 02:56
@RussTreadon-NOAA RussTreadon-NOAA added the bug Something isn't working label Feb 11, 2023
Copy link
Collaborator

@hu5970 hu5970 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are all look good.
I made two minor comment on adjusting indentation to make code reading easier.

regression/regression_test_enkf.sh Outdated Show resolved Hide resolved
regression/regression_test_enkf.sh Outdated Show resolved Hide resolved
hu5970
hu5970 previously approved these changes Feb 11, 2023
@RussTreadon-NOAA
Copy link
Contributor Author

Rerun global_enkf test on Hera. after 74cad42 to ensure that regression_test_enkf.sh still works as it should. The ctest failed because the conditional test changed in regression_test_enkf.sh at 74cad42 had the wrong syntax. Single brackets, not double brackets, should be used. Fix local copy and rerun global_enkf. This time the test passed.

Hera(hfe10):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr532/build$ ctest -R global_enkf
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr532/build
    Start 9: global_enkf
1/1 Test #9: global_enkf ......................   Passed  1179.25 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1179.26 sec

Commit correction at b6c56f2

@RussTreadon-NOAA
Copy link
Contributor Author

@hu5970 , thanks! This is above and beyond responsiveness!

@RussTreadon-NOAA RussTreadon-NOAA merged commit e55a937 into NOAA-EMC:develop Feb 11, 2023
@RussTreadon-NOAA RussTreadon-NOAA deleted the feature/global_4dvar branch February 11, 2023 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix failed global_4dvar ctest
2 participants